Preamble

This is an IPython Notebook to accompany the paper entitled 'The Architecture of Genome Wide Association Studies' by Melinda Mills and Charles Rahal. This allows us to share data analysis done in the construction of this data-driven review of all GWAS to date. It can also be used to dynamically replicate the analysis herein going forward. For installation guidelines, please see the readme.md which accompanies this repository. A preliminary point to note is that if you wish to run this .ipynb file itself, you may need to override the default settings of some versions of Jupyter Notebook (4.2 to 5.1) by opening with:

jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000

or editing your jupyter_notebook_config.py file. Lets begin by loading in our favourite python modules, ipython magic and a bunch of custom functions we've written specifically for this project:

In [1]:
import networkx as nx
import os
import pandas as pd
import re
import seaborn as sns
import sys
import numpy as np
import csv
import itertools
import warnings
import gender_guesser.detector as gender
from Bio import Entrez
from IPython.display import HTML, display
from Support.LoadData import (
    EFO_parent_mapper,
    load_gwas_cat,
    load_pubmed_data, 
    make_timely,
    make_clean_CoR,
    download_cat
)
from Support.PubMed import (
    build_collective,
    build_author,
    build_funder,
    build_abstract,
    build_citation
)
from Support.Analysis import (
    simple_facts,
    ancestry_parser,
    make_meanfemale_andranks,
    make_funders,
    mapped_trait_summary
)
from Support.Figures import (
    gwas_growth,
    choropleth_map,
    wordcloud_figure,
    plot_heatmap,
    plot_bubbles,
    boxswarm_plot
)
from Support.Additional import clean_names
from Support.Ancestry import ancestry_cleaner

warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'png'
%matplotlib inline
%load_ext autoreload
%autoreload 2

Then, let's dynamically grab the three main curated datasets from the GWAS Catalogue EBI website that we will need for our endeavours ('All Associations','All Studies', 'All Ancestries') and the EFO to Parent trait mapping file from their FTP site:

In [2]:
output_path = os.path.abspath(os.path.join('__file__',
                                    '../..',
                                    'data',
                                    'Catalogue',
                                    'Raw'))
ebi_download = 'https://www.ebi.ac.uk/gwas/api/search/downloads/'
download_cat(output_path, ebi_download)

Lets link the PUBMEDID variable to the PUBMED API and get a series of datasets from that using the support functions written in PubMed.py. Note: collective corresponds mostly consortia and study groups.

In [3]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
id_list = set(Cat_Studies['PUBMEDID'].astype(str).tolist())
Entrez.email = 'your@email.com'
papers = Entrez.read(Entrez.efetch(db='pubmed',retmode='xml', id=','.join(id_list)))
build_collective(papers)
build_author(papers)
build_funder(papers)
build_abstract(papers)
build_citation(id_list,Entrez.email)
Number of Collectives Found: 1546!
Authors with last names but no forenames: 7 out of 119948
Built a database of Authors from list of PUBMEDID IDs!
Built a database of Funders from list of PUBMEDID IDs!
Built a database of Abstracts from list of PUBMEDID IDs!
Processing for Citations: : 100%|████████████████████████████████████████████████| 3557/3557 [2:25:41<00:00,  2.46s/it]

Introduction

Some simple 'facts'

Lets do some basic descriptive analysis to see what we've got here:

In [4]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
simple_facts(Cat_Studies, Cat_Ancestry,
             Cat_Ancestry_groupedbyN, Cat_Full,
             AuthorMaster, mostrecentgwascatdate='2018-08-29')
There are currently: 3557 GWAS papers published (in terms of unique PUBMEDIDs)
The first observed GWAS in the catalogue was dated: 2005-03-10
However, only: 10 papers were published in 2005 and 2006 combined
There are currently: 5678 unique Study Accessions
There are currently: 3397 unique Diseases/Traits studied
These correspond to: 2447 unique EBI "Mapped Traits"
The total number of Associations found is currently: 81947
The average number of Associations is currently: 14.4
Mean P-Value for the strongest SNP risk allele is currently: 1.4166e-06
The number of associations reaching the 5e-8 threshold is 43890
The journal to feature the most GWAS studies since 2005-03-10 is: Nat Genet
However, in 2017, Nat Commun published the largest number of GWAS papers
So far, in 2018, Nat Genet published the largest number of GWAS papers
Largest Accession to date: 808380.0 people.
This was published in Nat Commun.
The first author was: Helgadottir A.
Total number of SNP-trait associations is 81947.
Total number of journals publishing GWAS is 459.
The study with the largest number of authors has: 559 authors.
The current publication lag in the Catalog is: 86 days
The lag between the date it was most recently published and date  of last included GWAS: 65 days

Can the abstracts give us any clues?

Lets make a nice infographic and do some basic NLP on the abstracts from all PUBMED IDs. This figure is the header on the readme.md:

In [5]:
abstract_count = os.path.abspath(
                 os.path.join('__file__', '../..', 'data', 'PUBMED',
                              'Pubmed_AbstractCount.csv'))
output_file = os.path.abspath(
              os.path.join('__file__', '../../', 'figures', 'pdf',
                           'helix_wordcloud_1250_5000_black.pdf'))
wordcloud_figure(abstract_count, output_file)

The file Pubmed_AbstractCount in PUBMED (subdirectory of Data) details a breakdown of the abstract counts. Check counts for 'European', 'Asian', etc.

The growth in depth and scope of GWAS over time

We can also visualise the ever increasing sample sizes and the popularity of GWAS. The top panel shows the increasing number of GWAS conducted over time. We also see increasingly large sample sizes: i.e. the fraction of 'Big N' sample size studies increases. The bottom left subplot shows that with this growth comes bigger N and a high correlatation with the number of significant Associations found. The right subplot shows the unique number of journals publishing GWAS per year and the unique number of diseases or traits studied. Both steadily increase as the appeal of GWASs broadens.

In [6]:
output_file = os.path.abspath(
              os.path.join('__file__', '../../', 'figures', 'svg',
                           'Figure_1.svg'))
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
gwas_growth(output_file, Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN)

Participants

Ancestry

This section of analysis only uses data contained within the catalogue, but is slightly deeper than Section 2 above and other related papers in the literature. We use the 'Broad Ancestral Category' field and aggregate from 135 combinations of 17 ancestries to 7 'broader ancestral categories' to calculate a comparable measure to Popejoy and Fullerton. To get a more detailed analysis of polyvocality, we load in some of our supporting functions to clean up the "INITIAL SAMPLE SIZE" and "REPLICATION SAMPLE SIZE" free-text fields in the catalogue and then calculate something analogous to Panofsky and Bliss. The supporting functions are based on regular expression and data wrangling techniques which exploit patterns in the these free-text fields. By way of a very simple example: “19,546 British ancestry individuals from 6863 families.” will get cleaned to two seperate fields: “19,546” and “British” which can then be used for further downstream analysis. A slightly more complex example: “2,798 European ancestry individuals, 228 French Canadian founder population individuals” will correspond to two entries of 2798 and 228 in the new ‘Cleaned N’ type variable, corresponding to ‘European’ and ‘French Canadian’ in the ‘Cleaned Ancestry’ type variable respectively. Uncomment the appropriate lines of text to get the full lists.

In [7]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
Cat_Studies['InitialClean'] = Cat_Studies.apply(
    lambda row: ancestry_cleaner(row, 'INITIAL SAMPLE SIZE'), axis=1)
output_path = os.path.abspath(
              os.path.join('__file__',
                           '../..',
                           'data',
                           'Catalogue',
                           'Synthetic',
                           'new_initial_sample.csv'))
ancestry_parser(output_path, 'InitialClean', Cat_Studies)
Cat_Studies['ReplicationClean'] = Cat_Studies.apply(
    lambda row: ancestry_cleaner(row, 'REPLICATION SAMPLE SIZE'), axis=1)
output_path = os.path.abspath(
              os.path.join('__file__',
                           '../..',
                           'data',
                           'Catalogue',
                           'Synthetic',
                           'new_replication_sample.csv'))
ancestry_parser(output_path, 'ReplicationClean', Cat_Studies)
In [8]:
clean_intial = pd.read_csv(os.path.abspath(
                           os.path.join('__file__', '../..', 'data',
                                        'Catalogue', 'Synthetic',
                                        'new_initial_sample.csv')),
                           encoding='utf-8')
clean_initial_sum = pd.DataFrame(
    clean_intial.groupby(['Cleaned Ancestry']).sum())
clean_initial_sum.rename(
    columns={'Cleaned Ancestry Size': 'Initial Ancestry Sum', }, inplace=True)
clean_initial_count = clean_intial.groupby(['Cleaned Ancestry']).count()
clean_initial_count.rename(
    columns={'Cleaned Ancestry Size': 'Initial Ancestry Count', }, inplace=True)
clean_initial_merged = clean_initial_sum.merge(pd.DataFrame(
    clean_initial_count['Initial Ancestry Count']), how='outer', left_index=True,
    right_index=True)
clean_initial_merged = clean_initial_merged.sort_values(
    by='Initial Ancestry Sum', ascending=False)
holder = ''
for index, row in clean_initial_merged.iterrows():
    holder = holder + \
        str(index) + ' (' + str(row['Initial Ancestry Sum']) + \
        ',' + str(row['Initial Ancestry Count']) + '), '
print('There are a total of ' + str(len(clean_initial_merged)) +
      ' ancestries found in the \'INITIAL SAMPLE SIZE\' column.' +
      '\nThese are: (number of people used used in all studies, ' +
      'number of studies included):\n\n' + holder[:-2] + '.\n\n')
There are a total of 207 ancestries found in the 'INITIAL SAMPLE SIZE' column.
These are: (number of people used used in all studies, number of studies included):

European (86308821,5775), Japanese (6483362,582), British (3927349,50), Icelandic (2290333,20), African American (1621477,711), East Asian (1595237,203), Finnish (918380,157), African (670041,159), Korean (559893,241), Hispanic (538177,250), Han Chinese (537566,383), South Asian (496698,52), Hispanic/Latino (281971,62), Asian (281048,140), Chinese (274747,165), Indian Asian (181580,44), Latin American (162910,42), African American/Afro Caribbean (161018,36), Indian (127862,52), Hispanic/Latino American (116226,12), African American or Afro Caribbean (102828,10), Hispanic or Latino (83557,4), European and Asian (79845,2), Dutch (75049,20), Taiwanese Han Chinese (72100,14), Swedish (67990,23), Sardinian (66472,24), European American (63371,37), Filipino (61239,41), Northern European (59894,17), Latino (54928,41), Pomak (53847,31), Bangladeshi (48462,34), Hispanic/Latin American (48109,7), Rucphen (46182,26), Mylopotamos (45756,31), Hispanic or Latin American (44659,10), Mexican (42743,25), Malay (41887,24), German (39536,15), Pakistani (38610,11), European and East Asian (35524,2), European and South Asian (30482,2), Orcadian (30241,27), Mexican American (29291,42), Asian Indian (26798,9), Black (24820,29), Ashkenazi Jewish (23674,34), European and Hispanic (22604,1), European and African American (22593,2), Old Order Amish (21338,40), Italian (20264,12), Val Borbera (19754,15), French (18714,10), European and Indian Asian (17967,1), Saudi Arab (17030,8), Singaporean Chinese (14920,24), South East Asian (14708,12), Northern Finnish (14289,3), Hispanic American (13835,16), Northwestern European (13483,2), British and Irish (11375,2), Western European (11043,3), African British (10206,4), Sub Saharan African (9377,25), Brazilian (8761,19), Ugandan (8583,9), African American and African (8298,2), Danish (8247,4), Mexican and Latin American (8214,2), Thai (8143,18), Korkulan (7706,11), Turkish (7638,14), French Canadian (7508,13), Austrian (7407,9), Puerto Rican (7096,9), Sorbian (6982,8), Mainland Hispanic/Latino (6734,1), Fruili Venezia Giulia (6634,7), European and  Rucphen (6597,1), Vietnamese (6469,4), Polish (6413,13), Lebanese (6308,6), Norwegian (5704,6), Caribbean Hispanic (5607,4), Caribbean Hispanic/Latino (5458,1), Gambian (5038,4), Amish (5035,10), Latino American (4912,4), South Tyrol (4728,5), Latino or African (4658,9), Punjabi Sikh (4619,5), Southern Chinese (4582,3), Indo European (4238,4), Kosraen (4168,2), Hutterite (3776,9), American Indian (3704,12), Carlantino (3667,10), Nigerian (3564,3), West African (3434,5), South Indian (3265,4), Pima Indian (3251,8), Afro Caribbean (3201,4), Taiwanese (3104,6), Samoan (3072,1), African American or European (3054,2), Tibetan (3043,2), Tanzanian (3041,5), African American/African Caribbean (3015,4), Kenyan (2857,2), Native American North American (2835,2), North Indian (2692,4), Sereer (2640,5), Micronesian (2346,1), Celtic (2307,1), Oceania (2229,2), Cilento (2163,3), Indonesian (2041,5), Sirente (2002,7), Middle Eastern (1950,2), Black Hispanic (1940,4), Native Hawaiian (1902,6), Moroccan (1851,9), Japanese American (1697,3), African American and African Caribbean (1612,1), Surinamese (1553,7), African and African Arab (1215,4), Spanish (1193,4), Sri Lankan Sinhalese (1154,4), Slavic (1102,1), Southern European (1090,1), Latin/South American (1050,2), Yoruban (1047,3), Split (986,2), African and Asian (968,6), African Caribbean (929,2), Ogliastran (897,1), Ethiopian (895,6), Native American South American (875,2), African Arab (853,4), Arab (836,6), Mongolian (756,1), Martu Australian Aboriginal (752,3), Central European (744,1), South African (733,2), Uyghur (721,1), Tyrolean (696,1), Mixed (684,1), South African Black (637,2), Thai Chinese (618,2), Papua New Guinean (576,2), Bulgarian (564,2), European American and Mexican American (540,1), Vietnamese Korean (518,1), Costa Rican (506,2), Jewish (497,2), Isreali Arab (496,6), Middle Eastern/North African (482,4), Taiwanese Han (470,2), Talana (470,1), Irish (432,2), Malaysian Chinese (371,2), Russian (366,6), Iranian (352,2), Silk Road (348,1), Finnish Saami (347,1), Jewish Isreali (331,2), North African (329,7), Plains American Indian (322,1), Fijian Indian (319,2), Hong Kong Chinese (315,1), Norfolk Island (285,2), Greater Middle Eastern (281,2), Asian American (261,8), South African Afrikaner (251,2), Hispanic/Native American (237,1), Tatar (231,2), Southern Indian (229,2), Malawian (226,2), Tunisian (221,2), Portuguese (221,2), Asian or Pacific Islander (207,3), Arab Isreali (189,1), Hispanic and European (180,2), Hispanic/Native (164,1), Bashkir (161,2), Han Chinese American (156,2), Hispanic Caucasian (136,2), Nepalese (99,2), Oriental (90,1), Native American (89,7), Solomon Islander (85,2), Uygur Chinese (83,1), Dai Chinese (58,1), European and Mexican (41,2), Jingpo Chinese (40,1), Romanian (32,1), European/Asian (27,2), Caucasian Eastern Mediterranean (26,2), Native Hawaiian or Pacific (14,2), Hui Chinese (11,1), Cape Verdian (4,2), Native American or Alaska Native (4,2), Curacao (2,2), Dominican Republic (2,2), Isreali (2,2), Afghanistan (2,2).


In [9]:
clean_replication = pd.read_csv(os.path.abspath(
                                os.path.join('__file__',
                                             '../..',
                                             'data',
                                             'Catalogue',
                                             'Synthetic',
                                             'new_replication_sample.csv')),
                                encoding='utf-8')
clean_replication_sum = pd.DataFrame(
    clean_replication.groupby(['Cleaned Ancestry']).sum())
clean_replication_sum.rename(
    columns={'Cleaned Ancestry Size': 'Replication Ancestry Sum', }, inplace=True)
clean_replication_count = clean_replication.groupby(
    ['Cleaned Ancestry']).count()
clean_replication_count.rename(
    columns={'Cleaned Ancestry Size': 'Replication Ancestry Count', }, inplace=True)
clean_replication_merged = clean_replication_sum.merge(
    pd.DataFrame(clean_replication_count['Replication Ancestry Count']),
    how='outer', left_index=True, right_index=True)
clean_replication_merged = clean_replication_merged.sort_values(
    by='Replication Ancestry Sum', ascending=False)
holder = ''
for index, row in clean_replication_merged.iterrows():
    holder = holder + \
        str(index) + ' (' + str(row['Replication Ancestry Sum']) + \
        ',' + str(row['Replication Ancestry Count']) + '), '
print('There are a total of ' + str(len(clean_replication_merged)) +
      ' ancestries found in the \'REPLICATION SAMPLE SIZE\' column.' +
      ' These are (number of people used used in all studies, number of' +
      ' studies included):\n\n' + holder[:-2] + '.')
There are a total of 148 ancestries found in the 'REPLICATION SAMPLE SIZE' column. These are (number of people used used in all studies, number of studies included):

European (36222204,3025), Asian (3387710,110), East Asian (2129170,237), Icelandic (1181163,17), Japanese (1059831,292), Han Chinese (1023973,254), African American (753265,350), Chinese (417161,181), Hispanic (397124,146), South Asian (394398,60), Korean (306162,146), British (251083,3), African (192176,76), European and Asian (136010,2), Indian Asian (118255,14), African American/Afro Caribbean (104015,48), European and African American (99227,9), European and East Asian (87155,5), Finnish (76236,23), Hispanic/Latino (62435,12), Indian (50713,47), Hispanic/Latino American (43200,6), European and Middle East/North African (29807,2), African America/Afro Caribbean (27661,1), Mexican (27148,17), Sardinian (25831,13), Latino (25299,17), Sub Saharan African (23752,26), European and  Rucphen (22789,1), Malay (22036,27), Brazilian (21631,17), Filipino (21546,23), African American and African (20809,1), Dutch (20367,3), Black and European (19090,1), Mexican American (18703,24), Northern European (16962,3), Ashkenazi Jewish (15801,16), European and Indian Asian (13615,1), Asian Indian (12906,7), Indo European (12659,5), African American and Afro Caribbean (11583,3), Southern Chinese (11058,3), Old Order Amish (10887,12), Punjabi Sikh (10261,5), Thai (10032,25), African American or Afro Caribbean (8972,3), Turkish (8855,11), Hispanic/Latin American (8622,5), European American (8614,3), Rucphen (8337,7), Mexican/Latino (8214,2), Saudi Arab (7692,12), Pima Indian (7495,9), American Indian (7039,7), French Canadian (6846,15), Swedish (6212,3), Pakistani (6115,6), German (5569,4), Vietnamese (5490,4), Indonesian (5253,7), African British (5103,2), Silk Road (5017,11), West African (4780,7), Southern Indian (4600,2), Middle Eastern (4591,5), Danish (4258,2), Black (4111,12), Singaporean (3968,2), North Indian (3956,4), Russian (3950,2), Singaporean Chinese (3580,2), Gambian (3463,2), Dravidian (3260,4), Seychellois (3258,14), She Chinese (3235,1), Iranian (3210,9), Northwestern European (2942,2), Hispanic / Latino (2826,1), Talana (2508,3), Amish (2429,4), African American and African Caribbean (2147,1), Arab (2120,13), Samoan (2102,1), European and Ashkenazi Jewish (2035,2), African American and European (1893,1), Polish (1810,9), Costa Rican (1778,3), South and Central American (1769,1), Cilento (1709,2), Hispanic and European (1682,2), North African (1566,10), Polynesian (1536,2), Singaporean Indian (1520,1), Nepalese (1478,6), Caribbean Hispanic (1390,6), Latin American (1386,6), Oceania (1372,5), Southern European (1296,1), Native Hawaiian (1277,2), Hutterite (1255,2), Moroccan (1247,3), Latino American (1242,2), Indigenous Mexican (1178,2), European and Hispanic (1174,4), Jamaican (1089,2), Orcadian (1035,2), Afro Caribbean (962,1), Fruili Venezia Giulia (957,2), Val Borbera (910,1), Uygur Kazakh Chinese (840,2), Asian American (670,2), Spanish (662,2), Sorbian (659,1), Mongolian (542,2), Portuguese (525,2), South African Black (481,2), Mayan (475,2), Bulgarian (450,2), Ethiopian (433,6), Malaysian Chinese (420,2), Taiwanese (400,1), Middle Eastern Arab (352,2), Jewish (344,2), Carlantino (322,1), Afro Caribbean and Sub Saharan African (321,1), Surinamese (284,1), Hispanic and Native American (282,1), French (276,1), Western European (253,1), Malaysian (212,2), Arab Isreali (198,2), Tibetan (161,1), Yoruban (146,2), Uyghur (99,2), Ashkenazi Jewish Isreali (96,2), Hispanic American (85,1), Jewish Isreali (74,2), European/Asian (74,2), African America and European (68,1), Italian (45,2), African and Asian (38,2), Aboriginal Canadian (15,2), Native Hawaiian or Pacific (7,1), Black African (5,1), Native American or Alaska Native (4,1), South African (2,1), Colored African (1,1).

Lets aggregate to create a dataframe for natives/indigenous people using a dictionary based classifier

In [10]:
native_dict = pd.read_csv(os.path.abspath(os.path.join('__file__', '../..',
                                         'data', 'Support',
                                         'native_classifier_dictionary.csv')),
                            index_col='Cleaned Ancestry')
native_df = pd.merge(clean_replication_merged, clean_initial_merged,
                   left_index=True,right_index=True, how='outer')
if len(np.setdiff1d(native_df.index.values, native_dict.index.values))>0:
    print('There are still ancestries which need classifying as unique!'+
         ' Add them to the dictionary: ' +
         ', '.join(np.setdiff1d(native_df.index.values,native_dict.index.values)))
else:
    print('Congratulations! The Native/Indigenous dictionary is fully up to date!')
native_df = native_df.fillna(0)
native_df['Total Ancestry Sum'] = native_df['Replication Ancestry Sum'] + \
                                  native_df['Initial Ancestry Sum']
native_df['Total Ancestry Count'] = native_df['Replication Ancestry Count'] + \
                                    native_df['Initial Ancestry Count']
native_df = pd.merge(native_df,native_dict,left_index=True, right_index=True)
print('We observe ' + str(len(native_df)) + ' unique terms for ancestry/ethnicity in total.')
print('Of these, ' + str(len(native_df[native_df['Native Population']!='Not Native'])) + 
      ' relate to native or Indigenous populations ancestry/ethnicity.')
Congratulations! The Native/Indigenous dictionary is fully up to date!
We observe 235 unique terms for ancestry/ethnicity in total.
Of these, 19 relate to native or Indigenous populations ancestry/ethnicity.
In [11]:
print('These relate to ' +
      str(len(native_df[native_df['NativeType']=='Indicated']['Native Population'].unique())) + 
      ' specific native/indigenous groups as implied by the free text.')
print('These are: '+
     ', '.join(native_df[native_df['NativeType']=='Indicated']['Native Population'].unique()))
print('Out of ' + str(int(native_df['Total Ancestry Count'].sum())) +
     ' mentions of ancestry\ethnicity, ' +
     str(int(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Count'].sum())) + 
     ' relate to native/indigenous groups as implied by the free text.' +
     '('+
      str(round(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Count'].sum()/
      native_df['Total Ancestry Count'].sum()*100,3)) +'%)')
print('Out of ' + str(int(native_df['Total Ancestry Sum'].sum())) +
     ' participants, ' +
     str(int(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sum())) + 
     ' are of native/indigenous groups as implied by the free text.'  +
     '('+
      str(round(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sum()/
      native_df['Total Ancestry Sum'].sum()*100,3)) +'%)')
print('Most commonly used natives implied by the free text is: ' +
     native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sort_values(ascending=False).index.values[0] +
     ' which contributed ' + 
      str(int(native_df[native_df['NativeType']=='Indicated']['Total Ancestry Sum'].sort_values(ascending=False)[0]))+
     ' people in terms of sample.')
These relate to 8 specific native/indigenous groups as implied by the free text.
These are: Aboriginal Canadians, Hispanic/Native American, Indigenous Mexicans, Indigenous Australian, Native Americans, Indigenous peoples of South America, Native Americans or Alaska Natives, Native Hawaiians
Out of 16105 mentions of ancestry\ethnicity, 34 relate to native/indigenous groups as implied by the free text.(0.211%)
Out of 158738617 participants, 9353 are of native/indigenous groups as implied by the free text.(0.006%)
Most commonly used natives implied by the free text is: Native Hawaiian which contributed 3179 people in terms of sample.
In [12]:
print('These relate to ' +
      str(len(native_df[native_df['NativeType']!='Not Native']['Native Population'].unique())) + 
      ' specific native/indigenous implied by the free text or which are manually classified.')
print('These are: '+
     ', '.join(native_df[native_df['NativeType']!='Not Native']['Native Population'].unique()))
print('Out of ' + str(int(native_df['Total Ancestry Count'].sum())) +
     ' mentions of ancestry\ethnicity, ' +
     str(int(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Count'].sum())) + 
     ' relate to native/indigenous populations implied by the free text or which are manually classified.' +
     '('+
      str(round(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Count'].sum()/
      native_df['Total Ancestry Count'].sum()*100,3)) +'%)')
print('Out of ' + str(int(native_df['Total Ancestry Sum'].sum())) +
     ' participants, ' +
     str(int(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sum())) + 
     ' are of native/indigenous populations implied by the free text or which are manually classified.' +
     '('+
      str(round(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sum()/
      native_df['Total Ancestry Sum'].sum()*100,3)) +'%)')
print('Most commonly used natives implied by the free text or which are manually classified is: ' +
     native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sort_values(ascending=False).index.values[0] +
     ' which contributed ' + 
      str(int(native_df[native_df['NativeType']!='Not Native']['Total Ancestry Sum'].sort_values(ascending=False)[0]))+
     ' people in terms of sample.')
These relate to 16 specific native/indigenous implied by the free text or which are manually classified.
These are: Aboriginal Canadians, Native American, Pacific Islander, Bashkirs, Saami, Hispanic/Native American, Indigenous Mexicans, Indigenous Australian, Mayan, Native Americans, Indigenous peoples of South America, Native Americans or Alaska Natives, Native Hawaiians, Pima Indians, Plains Indians, Samoan
Out of 16105 mentions of ancestry\ethnicity, 81 relate to native/indigenous populations implied by the free text or which are manually classified.(0.503%)
Out of 158738617 participants, 37528 are of native/indigenous populations implied by the free text or which are manually classified.(0.024%)
Most commonly used natives implied by the free text or which are manually classified is: Pima Indian which contributed 10746 people in terms of sample.

Lets first now do a quick check that our 'Broader' dictionary is up to date:

In [13]:
if len(Cat_Ancestry[Cat_Ancestry['Broader'].isnull()]) > 0:
    print('Perhaps we need to update the dictionary for new terms? Something like these ones:\n' +
         '\n'.join(Cat_Ancestry[Cat_Ancestry['Broader'].isnull()]['BROAD ANCESTRAL'].unique()))
else:
    print('Nice! Looks like our Broad to Broader dictionary is up to date!')
Nice! Looks like our Broad to Broader dictionary is up to date!

Lets put this number into tables to get a deeper understanding, first including a row which has at least one ancestry as NR, and then dropping all rows in which at least one recorded ancestry is NR:

In [14]:
Broad_Ancestral_Full_initial = pd.DataFrame(
    Cat_Ancestry[Cat_Ancestry.STAGE ==
                 'initial'].groupby(['Broader'])['N'].sum())
Broad_Ancestral_Full_initial.rename(
    columns={'N': 'N (Initial)'}, inplace=True)
Broad_Ancestral_Full_replication = pd.DataFrame(
    Cat_Ancestry[Cat_Ancestry.STAGE ==
                 'replication'].groupby(['Broader'])['N'].sum())
Broad_Ancestral_Full_replication.rename(
    columns={'N': 'N (Replication)'}, inplace=True)
Broad_Ancestral_Full = pd.merge(Broad_Ancestral_Full_initial,
                                Broad_Ancestral_Full_replication,
                                left_index=True, right_index=True)
Broad_Ancestral_Full['Total'] = Broad_Ancestral_Full['N (Initial)'] + \
    Broad_Ancestral_Full['N (Replication)']
Broad_Ancestral_Full['% Discovery'] = (
    Broad_Ancestral_Full['N (Initial)'] /
    Broad_Ancestral_Full['N (Initial)'].sum()) * 100
Broad_Ancestral_Full['% Replication'] = (
    Broad_Ancestral_Full['N (Replication)'] /
    Broad_Ancestral_Full['N (Replication)'].sum()) * 100
Broad_Ancestral_Full['% Total'] = (Broad_Ancestral_Full['Total'] /
                                   Broad_Ancestral_Full['Total'].sum()) * 100
Broad_Ancestral_Full.to_csv(os.path.abspath(
                            os.path.join('__file__',
                                         '../..',
                                         'tables',
                                         'Broad_Ancestral_Full.csv')))
Broad_Ancestral_Full
Out[14]:
N (Initial) N (Replication) Total % Discovery % Replication % Total
Broader
African 361817 136576 498393 0.307131 0.251228 0.289480
African Am./Caribbean 2258708 1015713 3274421 1.917324 1.868380 1.901869
Asian 11018150 9013536 20031686 9.352851 16.580188 11.634927
European 94342989 37652655 131995644 80.083851 69.261175 76.666524
Hispanic/Latin American 1502967 665219 2168186 1.275806 1.223655 1.259339
In Part Not Recorded 7762814 5022605 12785419 6.589531 9.238964 7.426106
Other/Mixed 557815 856988 1414803 0.473506 1.576409 0.821755

Drop the 'in part not recorded' rows (i.e. where the Broad Ancestral Category contains at least one NR):

In [15]:
Broad_Ancestral_NoNR = Broad_Ancestral_Full[['N (Initial)',
                                             'N (Replication)',
                                             'Total']]
Broad_Ancestral_NoNR = Broad_Ancestral_NoNR.drop('In Part Not Recorded')
Broad_Ancestral_NoNR['Total'] = Broad_Ancestral_NoNR['N (Initial)'] + \
    Broad_Ancestral_NoNR['N (Replication)']
Broad_Ancestral_NoNR['% Discovery'] = (
    Broad_Ancestral_NoNR['N (Initial)'] /
    Broad_Ancestral_NoNR['N (Initial)'].sum()) * 100
Broad_Ancestral_NoNR['% Replication'] = (
    Broad_Ancestral_NoNR['N (Replication)'] /
    Broad_Ancestral_NoNR['N (Replication)'].sum()) * 100
Broad_Ancestral_NoNR['% Total'] = (Broad_Ancestral_NoNR['Total'] /
                                   Broad_Ancestral_NoNR['Total'].sum()) * 100
Broad_Ancestral_NoNR.to_csv(os.path.abspath(
                            os.path.join('__file__', '../..', 'tables',
                                         'Broad_Ancestral_NoNR.csv')))
Broad_Ancestral_NoNR
Out[15]:
N (Initial) N (Replication) Total % Discovery % Replication % Total
Broader
African 361817 136576 498393 0.328798 0.276802 0.312701
African Am./Caribbean 2258708 1015713 3274421 2.052579 2.058571 2.054434
Asian 11018150 9013536 20031686 10.012636 18.267958 12.568260
European 94342989 37652655 131995644 85.733272 76.311574 82.816570
Hispanic/Latin American 1502967 665219 2168186 1.365807 1.348216 1.360361
Other/Mixed 557815 856988 1414803 0.506909 1.736879 0.887674

What are the broad ancestral fields before we visualise the broader field? Dropping all rows for multiple (comma separated entries):

Lets now make some bubble plots to visualise the raw broad ancestry field and our fields extracted from the free text strings:

In [16]:
GroupAnc = Cat_Ancestry[(Cat_Ancestry['BROAD ANCESTRAL'] != 'Other') &
                        (Cat_Ancestry['BROAD ANCESTRAL'].str.count(',') == 0)].\
    groupby(['BROAD ANCESTRAL'])['N'].sum().to_frame()
GroupAnc['Individuals (%)'] = (GroupAnc['N'] /
                               GroupAnc['N'].sum()) * 100
GroupAnc.sort_values(by='Individuals (%)',ascending=False)[0:10]
Out[16]:
N Individuals (%)
BROAD ANCESTRAL
European 131995644 78.398921
East Asian 16041195 9.527681
NR 10565256 6.275242
African American or Afro-Caribbean 3274421 1.944845
Hispanic or Latin American 2168186 1.287796
Asian unspecified 2027497 1.204234
South Asian 1674794 0.994745
African unspecified 333154 0.197877
South East Asian 116675 0.069299
Sub-Saharan African 104798 0.062245

Now lets visualise the occurances of the 'Broader' ancestry conversion:

In [17]:
countriesdict = {'African': '#4daf4a', 'African Am./Caribbean': '#984ea3',
                 'Asian': '#e41a1c', 'European': '#377eb8',
                 'Hispanic/Latin American': '#ff7f00', 'Other/Mixed': '#ffff33'}
output_path = os.path.abspath(os.path.join('__file__', "../..", "figures", "svg",
                                             "Figure_2.svg"))
plot_bubbles(output_path, Cat_Ancestry, Broad_Ancestral_NoNR, countriesdict)
print('The biggest accession is ' + str(Cat_Ancestry['N'].max()))
The biggest accession is 586626

We can also analyze how these aggregates change over time, and this was a key feature of Popejoy and Fullerton (2016). We can provide a much more granular arguement (merely remove '_NoNR' from the cell below to undertake an equivlent analysis with rows which contain some in part NR ancestries):

In [18]:
index = [x for x in range(2007, 2019)]
columns = ['European', 'Asian', 'African', 'Hispanic/Latin American',
           'Other/Mixed', 'African Am./Caribbean']
Cat_Ancestry_NoNR = Cat_Ancestry[
    Cat_Ancestry['Broader'] != 'In Part Not Recorded']
Broad_Ancestral_Time_NoNR_PC = pd.DataFrame(index=index, columns=columns)
for year in range(2007, 2019):
    for broaderancestry in Broad_Ancestral_NoNR.index.tolist():
        Broad_Ancestral_Time_NoNR_PC[broaderancestry.strip()][year] =\
            (Cat_Ancestry_NoNR[(
                Cat_Ancestry_NoNR['DATE'].str.contains(str(year))) &
                (Cat_Ancestry_NoNR['Broader'] ==
                 broaderancestry)]['N'].sum() /
             Cat_Ancestry_NoNR[
             (Cat_Ancestry_NoNR['DATE'].str.contains(
                 str(year)))]['N'].sum()) * 100
Broad_Ancestral_Time_NoNR_PC.to_csv(os.path.abspath(
                            os.path.join('__file__', '../..', 'tables',
                                         'Broad_Ancestral_Time_NoNR_PC.csv')))
Broad_Ancestral_Time_NoNR_PC.head(12)
Out[18]:
European Asian African Hispanic/Latin American Other/Mixed African Am./Caribbean
2007 95.4688 2.13984 0.00542792 0.715245 1.18391 0.486807
2008 95.2884 2.94597 0 0.00180063 1.21894 0.544872
2009 88.1708 7.10488 0.258137 0.224117 3.36046 0.881605
2010 86.6601 10.04 0.270719 0.0591476 2.47158 0.49843
2011 78.2316 15.8233 0.15679 0.397565 1.70889 3.68185
2012 71.9735 19.4701 0.313032 0.884615 2.87484 4.4839
2013 82.1979 11.6865 0.394273 0.787581 0.617715 4.31598
2014 76.6056 18.6152 0.250182 1.15204 0.97794 2.39904
2015 88.0646 9.11528 0.280238 0.738889 0.510574 1.29045
2016 90.5885 4.75047 0.171282 1.49728 1.12096 1.87149
2017 88.1227 6.2475 0.569588 2.29738 0.661675 2.10111
2018 70.5509 26.003 0.256689 1.42186 0.105128 1.6624

We can focus on the initial discovery stage to calculate the number of individuals required per ancestry to unconver one hit. Note however that this does require some distributional assumptions (i.e. that the same diseases are being studied across ancestries).

In [19]:
Cat_Ancestry_Initial = Cat_Ancestry[Cat_Ancestry['STAGE'] == 'initial']

Cat_Ancestry_NoDups = Cat_Ancestry_Initial.drop_duplicates(
    subset=['STUDY ACCESSION'],
    keep=False)[['PUBMEDID', 'STUDY ACCESSION',
                 'Broader','N']]
Cat_Ancestry_NoDups_merge = pd.merge(Cat_Ancestry_NoDups,
                                     Cat_Studies[['ASSOCIATION COUNT',
                                                  'STUDY ACCESSION',
                                                  'MAPPED_TRAIT']],
                                     how='left', on='STUDY ACCESSION')

listtoiterate = ['European', 'Other/Mixed', 'Asian','African',
                 'African Am./Caribbean', 'Hispanic/Latin American']
for ancestry in listtoiterate:
    temp = Cat_Ancestry_NoDups_merge[Cat_Ancestry_NoDups_merge[
        'Broader'].str.strip() == ancestry]
    print('The number of ' + ancestry + 's required to find one hit: ' +
          str(round(1 /
                    (temp['ASSOCIATION COUNT'].sum() / temp['N'].sum()), 1)))
The number of Europeans required to find one hit: 1558.8
The number of Other/Mixeds required to find one hit: 482.5
The number of Asians required to find one hit: 1369.1
The number of Africans required to find one hit: 779.0
The number of African Am./Caribbeans required to find one hit: 436.5
The number of Hispanic/Latin Americans required to find one hit: 393.4
In [20]:
 Cat_Ancestry_Initial['Broader'].unique()
Out[20]:
array(['European', 'Other/Mixed', 'Asian', 'In Part Not Recorded',
       'African Am./Caribbean', 'Hispanic/Latin American', 'African'],
      dtype=object)

Choropleth map of country of recruitment

This is a choropleth map of country of recruitment - note that this field is not quite fully populated in the Catalogue. Basemap code is loosely based on this. As we want to study the number of people recruited from each country, we can only utilize values in the ‘Country of Recruitment’ field when only one country is mentioned. For example: if N=100,000 and Country of Recruitment = {U.K., U.S.}, there is no way for us to know what the breakdown between the two countries is in the Catalogue (especially as the free text field may just contain 'European' ancestry). Our only option in this scenario is to drop such observations. This forces us to drop about 22% of all the studies, and this corresponds to about 48% of the Catalogue as measured by N (because bigger studies have multiple countries of recruitment). We are faced with multiple entries equal to ‘NR’, which corresponds to ‘not recorded’. This reduces the number of studies to go into the map by a further 21% (of the initial total), and this means that the map only includes about half of all GWAS studies. This has no relationship to whether ancestry is coded.

In [21]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
Clean_CoR = make_clean_CoR(Cat_Ancestry)
Cleaned_CoR_N = pd.DataFrame(Clean_CoR.groupby(['Cleaned Country'])['N'].sum())
Cleaned_CoR_N['Rank_N'] = Cleaned_CoR_N.rank(
    ascending=False, method='dense').astype(int)
Cleaned_CoR_count = pd.DataFrame(
    Clean_CoR.groupby(['Cleaned Country'])['N'].count())
Cleaned_CoR_count['Rank_count'] = Cleaned_CoR_count.rank(
    ascending=False, method='dense').astype(int)
Cleaned_CoR_count = Cleaned_CoR_count.rename(columns={'N': 'Count'})
Cleaned_CoR = pd.merge(Cleaned_CoR_count, Cleaned_CoR_N,
                       left_index=True, right_index=True)
del Cleaned_CoR.index.name
Cleaned_CoR['Count (%)'] = round((Cleaned_CoR['Count'] /
                                   Cleaned_CoR['Count'].sum()) * 100,2)
Cleaned_CoR['N (%)'] = round((Cleaned_CoR['N'] /
                               Cleaned_CoR['N'].sum()) * 100,2)
countrylookup = pd.read_csv(os.path.abspath(
                            os.path.join('__file__',
                                         '../..',
                                         'data',
                                         'ShapeFiles',
                                         'Country_Lookup.csv')),
                            index_col='Country')
country_merged = pd.merge(countrylookup,
                          Cleaned_CoR,
                          left_index=True,
                          right_index=True)
country_merged['Per Rec'] = round(country_merged['N']/pd.to_numeric(country_merged['2017population']),2)
country_merged.sort_values('Rank_N', ascending=True)[
    ['Continent', 'Count', 'N', 'Count (%)', 'N (%)', 'Per Rec']].to_csv(
    os.path.abspath(os.path.join('__file__',
                                 '../..',
                                 'tables',
                                 'CountryOfRecruitment.csv')))
country_merged.sort_values('Rank_N', ascending=True)[
    ['Continent', 'Count', 'N', 'Count (%)', 'N (%)', 'Per Rec']].head(10)
Cleaning for single country of recruitment: 72.342%  obs remain. 
This represents about 55.283% of the total GWAS N. 
When we drop for country==NR, we lose another: 24.62% papers.
6108 obs for this field remain out of a total of 11201 rows of Cat_Anc data
Out[21]:
Continent Count N Count (%) N (%) Per Rec
United Kingdom Europe 653 21294574 10.69 40.00 0.32
United States North America 2548 10549181 41.72 19.82 0.03
Japan Asia 460 7456048 7.53 14.01 0.06
Iceland Europe 71 6409109 1.16 12.04 19.13
China Asia 478 2034701 7.83 3.82 0.00
Finland Europe 208 1176077 3.41 2.21 0.21
Korea, South Asia 252 840876 4.13 1.58 0.02
Netherlands Europe 175 663477 2.87 1.25 0.04
Germany Europe 170 417234 2.78 0.78 0.01
Australia Oceania 108 319142 1.77 0.60 0.01
In [22]:
output_path = os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'figures',
                                           'svg',
                                           'Figure_3.svg'))
choropleth_map(country_merged, 'N', 'OrRd', output_path)
In [23]:
output_path = os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'figures',
                                           'svg',
                                           'Figure_3_Blues.svg'))
choropleth_map(country_merged, 'Per Rec', 'Blues', output_path)

Lets now merge that via a country lookup to continent based file using data from within the shapefile itself (based on CIAWF).

In [24]:
country_merged_sumcount = country_merged[[
    'Continent', 'Count']].groupby(['Continent']).sum()
country_merged_sumN = country_merged[[
    'Continent', 'N']].groupby(['Continent']).sum()
country_merged_sums = pd.merge(
    country_merged_sumN, country_merged_sumcount,
    left_index=True, right_index=True)
country_merged_sums['N (%)'] = (
    country_merged_sums['N'] / country_merged_sums['N'].sum()) * 100
country_merged_sums['Count (%)'] = (
    country_merged_sums['Count'] / country_merged_sums['Count'].sum()) * 100
country_merged_sums.to_csv(os.path.abspath(
                           os.path.join('__file__',
                                        '../..',
                                        'tables',
                                        'ContinentOfRecruitment.csv')))
country_merged_sums
Out[24]:
N Count N (%) Count (%)
Continent
Africa 61021 59 0.114633 0.965946
Asia 10952637 1498 20.575425 24.525213
Europe 31192652 1762 58.597949 28.847413
North America 10667907 2642 20.040536 43.254748
Oceania 331082 117 0.621965 1.915521
Seven seas (open ocean) 1389 1 0.002609 0.016372
South America 24957 29 0.046884 0.474787

Who funds GWAS and what do they fund?

A simple descriptive tabulation

Lets do a super simple tabulation of the Top 20 GWAS funders (measured by 'GWAS contributions': i.e. a funder funding an authors time on a paper).

In [25]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
AllFunders = FunderInfo.groupby(by='Agency').count()
AllFunders.index.name = None
AllFunders = AllFunders.reset_index()
AllFunders = AllFunders.rename(columns={
                               'index': 'Agency',
                               'PUBMEDID': 'Grant Contributions',
                               'GrantCountry': 'Country'})
AllFunders_withcountries = pd.merge(AllFunders[['Agency',
                                                'Grant Contributions']],
                                    FunderInfo[['Agency',
                                                'GrantCountry']].drop_duplicates(
                                        'Agency'),
                                    on='Agency', how='left')
AllFunders_withcountries = AllFunders_withcountries.set_index('Agency')
AllFunders_withcountries.index.name = None
AllFunders_withcountries['% of Total'] = round((
    AllFunders_withcountries['Grant Contributions'] /
    AllFunders_withcountries['Grant Contributions'].sum()) * 100, 2)
AllFunders_withcountries.sort_values(
    'Grant Contributions', ascending=False)[0:20]
Out[25]:
Grant Contributions GrantCountry % of Total
NHLBI NIH HHS 12485 United States 25.83
NCI NIH HHS 5151 United States 10.66
NIA NIH HHS 4034 United States 8.35
MRC 3497 United Kingdom 7.24
NIDDK NIH HHS 2675 United States 5.53
NIMH NIH HHS 2656 United States 5.50
Wellcome Trust 1801 United Kingdom 3.73
NHGRI NIH HHS 1789 United States 3.70
NCRR NIH HHS 1556 United States 3.22
NIAID NIH HHS 1190 United States 2.46
PHS HHS 1141 United States 2.36
NIAMS NIH HHS 1059 United States 2.19
NIAAA NIH HHS 905 United States 1.87
NINDS NIH HHS 750 United States 1.55
WHI NIH HHS 743 United States 1.54
NIDA NIH HHS 698 United States 1.44
NCATS NIH HHS 683 United States 1.41
Cancer Research UK 603 United Kingdom 1.25
NIGMS NIH HHS 596 United States 1.23
Intramural NIH HHS 585 United States 1.21

Lets print out some simple descriptives from this data:

In [26]:
print('There are ' + str(len(FunderInfo['Agency'].drop_duplicates())) +
      ' unique funders returned from PubMed Central.')
print('There are ' + str(len(FunderInfo['GrantID'].drop_duplicates())) +
      ' unique grants returned from PubMed Central.')
print('There are ' + str(len(FunderInfo['GrantCountry'].drop_duplicates())) +
      ' unique grant countries returned from PubMed Central.')
print('Each study has an average of ' +
      str(round(len(FunderInfo) / len(id_list), 2)) + ' grants funding it.')
grantgroupby = FunderInfo.groupby(['Agency', 'GrantID']).size().groupby(
    level=1).max().sort_values(ascending=False).reset_index()
print('The most frequently acknowledged grant is GrantID ' +
      grantgroupby['GrantID'][0] + '.\nThis grant is acknowledged ' +
      str(grantgroupby[0][0]) + ' times.')
There are 98 unique funders returned from PubMed Central.
There are 12656 unique grants returned from PubMed Central.
There are 7 unique grant countries returned from PubMed Central.
Each study has an average of 13.59 grants funding it.
The most frequently acknowledged grant is GrantID P30 DK063491.
This grant is acknowledged 205 times.

International distribution of funders

From which countries do these grants come from?

In [27]:
TopCountryFunders = FunderInfo.groupby(by='GrantCountry').count()
TopCountryFunders = TopCountryFunders.rename(
    columns={'PUBMEDID': 'Number Of Studies'})
TopCountryFunders = TopCountryFunders.sort_values(
    'Number Of Studies', ascending=False)[['Number Of Studies']]
TopCountryFunders = TopCountryFunders.reset_index().rename(
    columns={'GrantCountry': 'Country of Funder'})
TopCountryFunders['Percent Funded (%)'] = (
    TopCountryFunders['Number Of Studies'] /
    TopCountryFunders['Number Of Studies'].sum()) * 100
TopCountryFunders = TopCountryFunders.set_index('Country of Funder')
TopCountryFunders.index.name = None
TopCountryFunders.head(10)
Out[27]:
Number Of Studies Percent Funded (%)
United States 41108 85.122067
United Kingdom 6934 14.358189
Canada 175 0.362371
International 64 0.132524
Austria 7 0.014495
Italy 5 0.010353

What about the commas?

Lets first get a metric about splitting on the commas in the EFO field. We can make identical heatmaps by just replacing EFO_Parent_Mapped with EFO_Parent_Mapped_NoCommas. This is inherently making the assumption that a GWAS which has two mentioned EFO contributes equally to each of them: as much as a GWAS with one EFO (i.e. no comma to split on) contributes to that one specific EFO.

In [28]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)
print('There are ' + str(len(EFO_Parent_Mapped)) +
      ' rows of EFO terms after we split for commas.')
print('This indicates ' + str(len(EFO_Parent_Mapped) -\
                              len(EFO_Parent_Mapped_NoCommas)) +
      ' additional terms were mapped than for when we drop csvs.')
print('This indicates ' +
      str(len(EFO_Parent_Mapped['EFO term'].drop_duplicates())) +
      ' unique EFO terms to map to Parents.')
print('This is in comparison to ' +
      str(len(EFO_Parent_Mapped_NoCommas['EFO term'].drop_duplicates())) +
      ' unique EFO terms in Cat_Studies.')
There are 7830 rows of EFO terms after we split for commas.
This indicates 3579 additional terms were mapped than for when we drop csvs.
This indicates 1928 unique EFO terms to map to Parents.
This is in comparison to 1354 unique EFO terms in Cat_Studies.

What do they study?

In [29]:
mapped_trait_summary(EFO_Parent_Mapped, 'EFO term')
mapped_trait_summary(EFO_Parent_Mapped_NoCommas,'EFO term')
                             Sample   Studies  Associations
EFO term                                                   
Type II Diabetes Mellitus  1.731211  1.240885      0.701573
Body Mass Index            3.741513  1.202507      2.756953
Schizophrenia              0.913502  1.087374      1.169689
Breast Carcinoma           0.976628  0.985033      0.769564
Bipolar Disorder           0.365675  0.972240      0.339354
Unipolar Depression        2.541905  0.946655      0.453676
Alzheimers Disease         0.672238  0.946655      0.482557
HD LC measurement          0.799611  0.882692      1.128172
Triglyceride Meas.         0.779878  0.754765      1.074020
Colorectal Cancer          0.651778  0.754765      0.285804
Asthma                     0.328829  0.754765      0.330931
Prostate Carcinoma         0.699407  0.716387      0.399523
LD LC measurement          0.708513  0.703595      0.985571
Rheumatoid Arthritis       0.622737  0.588461      0.336346
Coronary Heart Disease     0.504370  0.575668      0.161855
Bone Density               0.289917  0.562876      0.921190
Diastolic Blood Pressure   3.052374  0.537291      1.150435
                             Sample   Studies  Associations
EFO term                                                   
Type II Diabetes Mellitus  2.793371  1.510146      1.361446
Alzheimers Disease         0.627830  1.250590      0.463328
Body Mass Index            2.866597  1.085418      2.548305
Breast Carcinoma           1.320781  1.085418      1.947322
HD LC measurement          1.107727  0.991034      1.024023
Prostate Carcinoma         1.008553  0.991034      0.834327
Colorectal Cancer          0.818280  0.943841      0.579160
Schizophrenia              0.757797  0.943841      1.903675
Bone Density               0.455278  0.873053      0.936729
LD LC measurement          0.997368  0.849457      0.856150
Asthma                     0.488390  0.755073      0.429754
Triglyceride Meas.         1.056322  0.731477      0.696671
Bipolar Disorder           0.341295  0.731477      0.317279
Unipolar Depression        1.971920  0.707881      0.747033
Rheumatoid Arthritis       0.456100  0.660689      0.812503
Body Height                1.292059  0.660689      1.378234
Parkinson'S Disease        0.921442  0.613497      0.394500
In [30]:
mapped_trait_summary(EFO_Parent_Mapped, 'Parent term')
mapped_trait_summary(EFO_Parent_Mapped_NoCommas, 'Parent term')
                              Sample    Studies  Associations
Parent term                                                  
Other Meas.                31.918073  29.691698     40.340437
Neurological Disorder       9.251026  10.579506      5.085500
Other Disease               4.772978   7.701164      3.095103
Cancer                      6.282496   6.767302      3.231688
Response To Drug            1.981252   6.293975      4.006667
Lipid/Lipoprotein Meas.     3.422785   4.963541      6.164936
Other Trait                 3.609500   4.873993      2.118557
Digestive System Disorder   3.207677   4.349495      2.401353
Cardiovascular Disease      7.166260   4.157605      1.829745
Immune System Disorder      2.814744   3.671485      2.334565
Biological Process          4.448970   3.607522      4.244335
Cardiovascular Meas.        3.735394   3.454010      3.435661
Body Meas.                  8.890780   3.249328     10.172204
Hematological Meas.         4.616052   2.788794      5.413423
Metabolic Disorder          2.528164   2.468978      0.983766
Inflammatory Meas.          0.890594   0.921069      3.669118
Liver Enzyme Meas.          0.463255   0.460535      1.472942
                              Sample    Studies  Associations
Parent term                                                  
Other Meas.                28.382415  28.574799     32.933237
Neurological Disorder       9.064098  11.538462      7.958838
Other Disease               5.469642   8.352997      4.274035
Cancer                      7.743437   8.022652      6.201212
Cardiovascular Disease      9.170862   4.813591      3.666337
Lipid/Lipoprotein Meas.     4.479362   4.813591      4.650070
Other Trait                 3.771132   4.672015      3.055280
Immune System Disorder      2.704053   4.436055      4.284108
Digestive System Disorder   2.673115   4.365267      4.304252
Cardiovascular Meas.        4.546126   3.869750      2.832010
Hematological Meas.         6.871096   3.680982     12.795246
Biological Process          2.854415   3.350637      2.845440
Body Meas.                  7.474555   3.209061      5.758029
Metabolic Disorder          3.223076   2.501180      1.866743
Response To Drug            0.125511   2.477584      1.185180
Inflammatory Meas.          0.759807   0.873053      0.990448
Liver Enzyme Meas.          0.687297   0.448325      0.399537

Who's funding what?

In [31]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)

EFO_Parent_Mapped['Parent term'] = \
    EFO_Parent_Mapped['Parent term'].str.replace('Disorder', '')

FunderInfo_Parent = pd.merge(FunderInfo, EFO_Parent_Mapped,
                             left_on='PUBMEDID', right_on='PUBMEDID',
                             how='left')

funder_ancestry, funder_parent = make_funders(FunderInfo_Parent, FunderInfo, Cat_Ancestry)
funder_parent.to_csv(os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'tables',
                                           'Funders_and_Ancestry.csv')))
funder_ancestry.to_csv(os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'tables',
                                           'Funders_and_BroadEFO.csv')))
output_path = os.path.abspath(os.path.join('__file__',
                                           '../..',
                                           'figures',
                                           'svg',
                                           'Figure_4.svg'))
plot_heatmap(funder_ancestry, funder_parent, output_path)

Lets do the same thing, but instead drop not split on the EFO commas:

In [32]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)

EFO_Parent_Mapped_NoCommas['Parent term'] = \
    EFO_Parent_Mapped_NoCommas['Parent term'].str.replace('Disorder', '')

FunderInfo_Parent = pd.merge(FunderInfo, EFO_Parent_Mapped_NoCommas,
                             left_on='PUBMEDID', right_on='PUBMEDID',
                             how='left')

funder_ancestry, funder_parent = make_funders(FunderInfo_Parent, FunderInfo, Cat_Ancestry)
output_path = os.path.abspath(os.path.join('__file__', '../..', 'figures',
                                         'svg', 'Figure_4_NoCommas.svg'))
plot_heatmap(funder_ancestry, funder_parent, output_path)

Who are the Authors?

Who are the most cited authors? What is their 'GWAS H-Index' calculated based only on their papers in the GWAS catalogue? This assumes unique forename + surname combinations, and that the same author is recorded consistently across studies. First a couple of snippets:

In [33]:
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
print('There are a total of ' + str(len(AuthorMaster)) +
      ' "authorship contributions".')
print('These contributions are made by ' +
      str(len(AuthorMaster['AUTHORNAME'].unique())) + ' unique authors.')
print('There are a total of ' +
      str(len(AuthorMaster) /
          len(AuthorMaster['PUBMEDID'].drop_duplicates())) +
      ' "authorship contributions per paper".')
print('The study with the most number of authors has ' +
      str(AuthorMaster.groupby(['PUBMEDID']).size().max()) +
      ' authors (PubMed ID: ' +
      str(AuthorMaster.groupby(['PUBMEDID']).size().idxmax()) + ')')
There are a total of 119684 "authorship contributions".
These contributions are made by 39397 unique authors.
There are a total of 33.79949166902005 "authorship contributions per paper".
The study with the most number of authors has 559 authors (PubMed ID: 22479202)

Calculate 'GWAS-H' indexes

Lets then calculate our GWAS-H indexes using citation data and paper counts for unique authors.

In [34]:
CitationCounts = pd.read_csv(os.path.abspath(
    os.path.join('__file__', "../..",
                 "data", "PUBMED", 'Pubmed_Cites.csv')))
AuthorMaster_withcites = pd.merge(
    AuthorMaster, CitationCounts, on=['PUBMEDID'], how='left')
allauthors_formerge = pd.DataFrame(
    AuthorMaster_withcites[['AUTHORNAME', 'citedByCount']])
allauthors_papercount = pd.DataFrame(
    allauthors_formerge['AUTHORNAME'].value_counts())
allauthors_citecount = pd.DataFrame(
    allauthors_formerge.groupby(by='AUTHORNAME')['citedByCount'].sum())
allauthors_merged = pd.merge(
    allauthors_papercount, allauthors_citecount, left_index=True,
    right_index=True)
allauthors_merged.columns = ['Papers', 'citedByCount']
allauthors_merged = allauthors_merged.sort_values(
    by='citedByCount', ascending=False)
allauthors_merged['GWAS-H'] = 0
counter = 0
for author in allauthors_merged.index:
    counter += 1
    temp = AuthorMaster_withcites[AuthorMaster_withcites['AUTHORNAME'] ==
                                  author].sort_values(by='citedByCount',
                                                      ascending=False).dropna()
    temp = temp.reset_index()
    temp = temp.drop('index', 1)
    for pubnumber in range(0, len(temp)):
        if pubnumber + 1 > temp['citedByCount'][pubnumber]:
            break
        else:
            allauthors_merged.loc[author, ('GWAS-H')] = int(pubnumber+1)
    sys.stdout.write('\r' + 'Calculating GWAS H-indices: finished ' +
                     str(counter) + ' of ' +
                     str(len(allauthors_merged.index)) + ' authors...')
allauthors_citecount.reset_index(inplace=True)
allauthors_papercount.reset_index(inplace=True)
allauthors_papercount.rename(
    columns={'AUTHORNAME': 'PAPERCOUNT', 'index': 'AUTHORNAME', },
    inplace=True)
Calculating GWAS H-indices: finished 39397 of 39397 authors...

Compare Author Metrics Between Genders

Calculate author centralities

Lets next calculate some measures of authorship centrality with Network-X. Note: this can take some time on slow computers. To get it to run faster, change the CitedByCount to be something like 100 to filter for authors with a minimum of a hundred citations only (or change PAPERCOUNT>1)

In [35]:
AuthorMaster_sumcites = pd.merge(AuthorMaster, allauthors_citecount,
                                 left_on='AUTHORNAME', right_on='AUTHORNAME',
                                 how='left')
AuthorMaster_sumcitespapercount = pd.merge(AuthorMaster_sumcites,
                                           allauthors_papercount,
                                           left_on='AUTHORNAME',
                                           right_on='AUTHORNAME', how='left')
AuthorMaster_sumcitespapercount_filter_cites = AuthorMaster_sumcitespapercount[
    AuthorMaster_sumcitespapercount['citedByCount'] > 10]
AuthorMaster_sumcitespapercount_filtered =\
    AuthorMaster_sumcitespapercount_filter_cites[
        AuthorMaster_sumcitespapercount_filter_cites['PAPERCOUNT'] > 1]

G = nx.Graph()
counter = 0
alledges = []
for paper in AuthorMaster_sumcitespapercount_filtered['PUBMEDID'].unique():
    counter += 1
    temp = AuthorMaster_sumcitespapercount_filtered[
        AuthorMaster_sumcitespapercount_filtered['PUBMEDID'] == paper]
    if len(temp) > 1:
        templist = list(itertools.combinations(temp.AUTHORNAME, 2))
        for edge in templist:
            alledges.append(edge)
G.add_edges_from(list(set(alledges)))

print('This gives us a network with ' + str(len(G)) +
      ' nodes.\nThese are unique authors with >1 paper and >10 cites')

betcent = pd.Series(nx.betweenness_centrality(G), name='Betweenness')
allauthors_merged = allauthors_merged.merge(
    betcent.to_frame(), left_index=True, right_index=True)

degcent = pd.Series(nx.degree_centrality(G), name='Degree')
allauthors_merged = allauthors_merged.merge(
    degcent.to_frame(), left_index=True, right_index=True)
This gives us a network with 14905 nodes.
These are unique authors with >1 paper and >10 cites

We can then merge all this data with some data collected manually from web searches related to their country of employment, their current employer, etc. We then rank in three different ways to analyze overlap between the three metrics.

In [36]:
authorsupplemental = pd.read_csv(os.path.abspath(
    os.path.join('__file__', '../..',
                 'data', 'Support', 'Author_Supplmentary.csv')),
     encoding='latin-1', index_col=0)
allauthors_merged_withsupp = pd.merge(allauthors_merged,
                                      authorsupplemental,
                                      left_index=True, right_index=True,
                                      how='left',)
allauthors_merged_withsupp['Betweenness'] = round(allauthors_merged_withsupp['Betweenness'],3)
allauthors_merged_withsupp['Degree'] = round(allauthors_merged_withsupp['Degree'],3)
allauthors_merged_withsupp['CitePercentile'] = round(allauthors_merged_withsupp.citedByCount.rank(pct=True),4)
allauthors_merged_withsupp[['Papers','citedByCount',
                           'GWAS-H','Betweenness',
                           'Degree','Country',
                           'Institution',
                           'CitePercentile']].sort_values(by='GWAS-H',
                                                         ascending=False).to_csv(os.path.abspath(
                                                                                 os.path.join('__file__',
                                                                                              '../..',
                                                                                              'tables',
                                                                                              'Authors.csv')))
allauthors_merged_withsupp.sort_values(by='GWAS-H', ascending=False).head(10)
Out[36]:
Papers citedByCount GWAS-H Betweenness Degree Country Institution CitePercentile
Kari Stefansson 173 27063 84 0.020 0.305 Iceland deCode Genetics 1.0000
Albert Hofman 263 24995 75 0.013 0.346 U.S. University of Harvard 0.9999
Unnur Thorsteinsdottir 138 23223 75 0.006 0.238 Iceland deCode Genetics 0.9999
Andre G Uitterlinden 276 22807 75 0.018 0.367 Netherlands Erasmus MC 0.9998
Cornelia M van Duijn 186 20414 70 0.008 0.294 Netherlands UMC Rotterdam 0.9997
Gudmar Thorleifsson 116 20017 69 0.006 0.229 Iceland deCode Genetics 0.9995
Panos Deloukas 107 19942 68 0.009 0.232 U.K. Queen Mary UoL 0.9994
H-Erich Wichmann 110 19961 68 0.007 0.218 Germany Helmholtz-Muenchen 0.9995
Christian Gieger 164 22085 68 0.012 0.273 Germany Helmholtz-Muenchen 0.9997
Fernando Rivadeneira 194 17636 63 0.009 0.283 Netherlands UMC Rotterdam 0.9992
In [37]:
allauthors_merged_withsupp.sort_values(by='Degree',
                                       ascending=False).head(10)
Out[37]:
Papers citedByCount GWAS-H Betweenness Degree Country Institution CitePercentile
Andre G Uitterlinden 276 22807 75 0.018 0.367 Netherlands Erasmus MC 0.9998
Albert Hofman 263 24995 75 0.013 0.346 U.S. University of Harvard 0.9999
Kari Stefansson 173 27063 84 0.020 0.305 Iceland deCode Genetics 1.0000
Cornelia M van Duijn 186 20414 70 0.008 0.294 Netherlands UMC Rotterdam 0.9997
Fernando Rivadeneira 194 17636 63 0.009 0.283 Netherlands UMC Rotterdam 0.9992
Jerome I Rotter 183 17438 59 0.011 0.278 U.S. UCLA 0.9991
Christian Gieger 164 22085 68 0.012 0.273 Germany Helmholtz-Muenchen 0.9997
Bruce M Psaty 165 12659 57 0.005 0.260 U.S. University of Washington 0.9964
Nicholas G Martin 158 13283 58 0.009 0.258 Australia QIMR Berghofer 0.9971
Vilmundur Gudnason 126 14985 58 0.003 0.256 Iceland I.H.A. 0.9982
In [38]:
allauthors_merged_withsupp.sort_values(by='citedByCount',
                                       ascending=False).head(10)
Out[38]:
Papers citedByCount GWAS-H Betweenness Degree Country Institution CitePercentile
Kari Stefansson 173 27063 84 0.020 0.305 Iceland deCode Genetics 1.0000
Albert Hofman 263 24995 75 0.013 0.346 U.S. University of Harvard 0.9999
Unnur Thorsteinsdottir 138 23223 75 0.006 0.238 Iceland deCode Genetics 0.9999
Andre G Uitterlinden 276 22807 75 0.018 0.367 Netherlands Erasmus MC 0.9998
Christian Gieger 164 22085 68 0.012 0.273 Germany Helmholtz-Muenchen 0.9997
Cornelia M van Duijn 186 20414 70 0.008 0.294 Netherlands UMC Rotterdam 0.9997
Mark I McCarthy 97 20339 62 0.003 0.186 U.K. University of Oxford 0.9996
Gudmar Thorleifsson 116 20017 69 0.006 0.229 Iceland deCode Genetics 0.9995
H-Erich Wichmann 110 19961 68 0.007 0.218 Germany Helmholtz-Muenchen 0.9995
Panos Deloukas 107 19942 68 0.009 0.232 U.K. Queen Mary UoL 0.9994

And then make a simple correlation matrix to check how highly related these metrics are:

In [39]:
allauthors_merged_withsupp[['citedByCount', 'GWAS-H',
                            'Betweenness', 'Degree', 'Papers']].corr()
Out[39]:
citedByCount GWAS-H Betweenness Degree Papers
citedByCount 1.000000 0.893697 0.501406 0.818449 0.846055
GWAS-H 0.893697 1.000000 0.525005 0.879581 0.943799
Betweenness 0.501406 0.525005 1.000000 0.476674 0.637231
Degree 0.818449 0.879581 0.476674 1.000000 0.850934
Papers 0.846055 0.943799 0.637231 0.850934 1.000000
In [40]:
print('There are a total of ' +
      str(len(allauthors_merged_withsupp)) + ' authors in the table.')
print('The person with the highest G-WAS H-Index is: ' +
      allauthors_merged_withsupp['GWAS-H'].idxmax())
print('The person with the highest Degree is: ' +
      allauthors_merged_withsupp['Degree'].idxmax())
print('The person with the highest citedByCount is: ' +
      allauthors_merged_withsupp['citedByCount'].idxmax())
print('The person with the most number of Papers is: ' +
      allauthors_merged_withsupp['Papers'].idxmax())
There are a total of 14906 authors in the table.
The person with the highest G-WAS H-Index is: Kari Stefansson
The person with the highest Degree is: Andre G Uitterlinden
The person with the highest citedByCount is: Kari Stefansson
The person with the most number of Papers is: Andre G Uitterlinden

Author gender

Lets consider the gender of each author by analyzing their forenames using the genderguesser library. We can directly compare our results to this wonderful paper. One of the key assumptions made here is to combine all 'mostly_' male and female names into their respective male/female categories.

In [41]:
gendet = gender.Detector()
FunderInfo, AbstractInfo, AuthorMaster = load_pubmed_data()
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()

AuthorCounts = AuthorMaster.groupby(['PUBMEDID'])['AUTHORNAME'].count(
).to_frame().reset_index().rename(columns={"AUTHORNAME": "Author Count"})
AuthorMaster = pd.merge(AuthorMaster, AuthorCounts, how='left', on='PUBMEDID')
AuthorMaster['CleanForename'] = AuthorMaster['FORENAME'].map(
    lambda x: clean_names(x))
AuthorMaster['CleanGender'] = AuthorMaster['CleanForename'].map(
    lambda x: gendet.get_gender(x))
AuthorMaster['MaleFemale'] = AuthorMaster['CleanGender'].str.replace('mostly_',
                                                                     '')
AuthorMaster['isfemale'] = np.where(
    AuthorMaster['MaleFemale'] == 'female', 1, 0)
AuthorFirst = AuthorMaster[AuthorMaster['Author Count']
                           > 4].drop_duplicates(subset='PUBMEDID', keep='first')
AuthorLast = AuthorMaster[AuthorMaster['Author Count']
                          > 4].drop_duplicates(subset='PUBMEDID', keep='last')
AuthorUnique = AuthorMaster.drop_duplicates(subset='AUTHORNAME')

for gender_ in AuthorMaster['CleanGender'].unique():
    print(str(round((len(AuthorMaster[AuthorMaster['CleanGender'] == gender_]) /
                     len(AuthorMaster['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' authorship contributions in total')
    print(str(round((len(AuthorUnique[AuthorUnique['CleanGender'] == gender_]) /
                     len(AuthorUnique['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' authors contributions in total')
    print(str(round((len(AuthorFirst[AuthorFirst['CleanGender'] == gender_]) /
                     len(AuthorFirst['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' first authors in total')
    print(str(round((len(AuthorLast[AuthorLast['CleanGender'] == gender_]) /
                     len(AuthorFirst['CleanGender'])) * 100, 2)) + '%' +
          ' ' + gender_ + ' last authors in total')

print('\nPercent of male author contributions: ' +
      str(round(len(AuthorMaster[AuthorMaster['MaleFemale'] == 'male']) /
                (len(AuthorMaster[(AuthorMaster['MaleFemale'] == 'male') |
                                  (AuthorMaster['MaleFemale'] == 'female')])) *
                100, 3)) + '%')

print('Percent of unique male authors: ' +
      str(round(len(AuthorUnique[AuthorUnique['MaleFemale'] == 'male']) /
                (len(AuthorUnique[(AuthorUnique['MaleFemale'] == 'male') |
                                  (AuthorUnique['MaleFemale'] == 'female')])) *
                100, 3)) + '%')

print('Percent of male first authors: ' +
      str(round(len(AuthorFirst[AuthorFirst['MaleFemale'] == 'male']) /
                (len(AuthorFirst[(AuthorFirst['MaleFemale'] == 'male') |
                                 (AuthorFirst['MaleFemale'] == 'female')])) *
                100, 3)) + '%')

print('Percent of male last authors: ' +
      str(round(len(AuthorLast[AuthorLast['MaleFemale'] == 'male']) /
                (len(AuthorLast[(AuthorLast['MaleFemale'] == 'male') |
                                (AuthorLast['MaleFemale'] == 'female')])) *
                100, 3)) + '%')

AuthorMaster_filtered = AuthorMaster[(AuthorMaster['MaleFemale'].str.contains(
    'male')) & (AuthorMaster['Author Count'] > 4)]
AuthorMaster_filtered_merged_bydisease = pd.merge(
    AuthorMaster_filtered, Cat_Studies[['PUBMEDID', 'DISEASE/TRAIT']],
    how='left', on='PUBMEDID')

meanfemales_bydisease = AuthorMaster_filtered_merged_bydisease.groupby(
    ['DISEASE/TRAIT'])['isfemale'].mean().to_frame()
numberofstudies_bydisease = Cat_Studies.groupby(
    ['DISEASE/TRAIT'])['DISEASE/TRAIT'].count().to_frame()
mergedgender_andcount_bydisease = pd.merge(
    numberofstudies_bydisease, meanfemales_bydisease, left_index=True,
    right_index=True)
mergedgender_andcount_bydisease = mergedgender_andcount_bydisease.sort_values(
    by='DISEASE/TRAIT', ascending=False)[0:10]
holdstring = 'Percent of authorships across 10 most commonly studied diseases:\n'
for index, row in mergedgender_andcount_bydisease.iterrows():
    holdstring = holdstring + \
        index.title() + ' (' + str(round(row['isfemale'], 3) * 100) + '%)\n'
print('\n' + holdstring[:-1])
21.21% unknown authorship contributions in total
28.9% unknown authors contributions in total
28.05% unknown first authors in total
24.03% unknown last authors in total
26.1% female authorship contributions in total
25.16% female authors contributions in total
27.21% female first authors in total
19.47% female last authors in total
45.49% male authorship contributions in total
37.81% male authors contributions in total
34.81% male first authors in total
49.28% male last authors in total
1.35% mostly_male authorship contributions in total
1.52% mostly_male authors contributions in total
1.39% mostly_male first authors in total
1.73% mostly_male last authors in total
1.38% mostly_female authorship contributions in total
1.48% mostly_female authors contributions in total
1.3% mostly_female first authors in total
1.79% mostly_female last authors in total
4.48% andy authorship contributions in total
5.12% andy authors contributions in total
7.25% andy first authors in total
3.7% andy last authors in total

Percent of male author contributions: 63.019%
Percent of unique male authors: 59.618%
Percent of male first authors: 55.937%
Percent of male last authors: 70.584%

Percent of authorships across 10 most commonly studied diseases:
Type 2 Diabetes (32.800000000000004%)
Breast Cancer (49.8%)
Body Mass Index (39.1%)
Colorectal Cancer (34.2%)
Schizophrenia (29.599999999999998%)
Prostate Cancer (40.6%)
Alzheimer'S Disease (40.1%)
Asthma (36.8%)
Height (37.2%)
Hdl Cholesterol (35.9%)
In [42]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
EFO_Parent_Mapped, EFO_Parent_Mapped_NoCommas = EFO_parent_mapper(
    Cat_Studies, Cat_Ancestry_groupedbyN)

ranks, countstudiesperEFO, meanfemale, AuthorMaster_EFO, AuthorMaster_Parent = make_meanfemale_andranks(AuthorMaster,
                                                                                    EFO_Parent_Mapped)
output_path = os.path.abspath(os.path.join('__file__', "../..", "figures",
                                           'svg', 'Sup_Figure_1_Commas.svg'))
boxswarm_plot(meanfemale, ranks, output_path)

AuthorMaster_Parent.reset_index(inplace=True)
holdstring = 'Get the numbers for Parent Terms:\n'
for index, row in AuthorMaster_Parent.iterrows():
    holdstring = holdstring + row['Parent term'].title() + ' (' \
        + str(round(row['isfemale'], 3) * 100) + '%)\n'
print('\n' + holdstring[:-2])
holdstring = 'Get the numbers for Trait Terms (figure above):\n'
for index, row in AuthorMaster_EFO[
    AuthorMaster_EFO['EFO term'].isin(countstudiesperEFO)].sort_values(
        by='isfemale', ascending=False).iterrows():
    holdstring = holdstring + \
        row['EFO term'].title() + ' (' + str(round(row['isfemale'], 3) *
                                             100) + '%)\n'
print('\n' + holdstring[:-1])
Get the numbers for Parent Terms:
Biological Process (40.8%)
Body Meas. (41.199999999999996%)
Cancer (46.6%)
Cardiovascular Disease (33.0%)
Cardiovascular Meas. (32.7%)
Digestive System Disorder (30.7%)
Hematological Meas. (33.7%)
Immune System Disorder (36.199999999999996%)
Inflammatory Meas. (32.5%)
Lipid/Lipoprotein Meas. (35.699999999999996%)
Liver Enzyme Meas. (35.199999999999996%)
Metabolic Disorder (33.6%)
Neurological Disorder (35.4%)
Other Disease (34.300000000000004%)
Other Meas. (38.4%)
Other Trait (35.699999999999996%)
Response To Drug (33.900000000000006%

Get the numbers for Trait Terms (figure above):
Breast Carcinoma (50.6%)
Prostate Carcinoma (41.8%)
Body Mass Index (41.0%)
Bipolar Disorder (37.1%)
Alzheimers Disease (36.7%)
Colorectal Cancer (36.5%)
Asthma (36.4%)
Triglyceride Meas. (36.199999999999996%)
Unipolar Depression (35.199999999999996%)
Hd Lc Measurement (35.199999999999996%)
Rheumatoid Arthritis (35.099999999999994%)
Ld Lc Measurement (34.599999999999994%)
Type Ii Diabetes Mellitus (33.1%)
Schizophrenia (31.0%)

Lets do exactly the same thing as above, but drop the fields with commas in:

In [43]:
ranks, countstudiesperEFO, meanfemale, AuthorMaster_EFO, AuthorMaster_Parent = make_meanfemale_andranks(AuthorMaster, EFO_Parent_Mapped_NoCommas)
output_path = os.path.abspath(os.path.join('__file__', "../..", "figures",
                                           'svg', 'Sup_Figure_1_NoCommas.svg'))
boxswarm_plot(meanfemale, ranks, output_path)

AuthorMaster_Parent.reset_index(inplace=True)
holdstring = 'Get the numbers for Parent Terms:\n'
for index, row in AuthorMaster_Parent.iterrows():
    holdstring = holdstring + row['Parent term'].title() + ' (' \
        + str(round(row['isfemale'], 3) * 100) + '%)\n'
print('\n' + holdstring[:-2])
holdstring = 'Get the numbers for Trait Terms (figure above):\n'
for index, row in AuthorMaster_EFO[
    AuthorMaster_EFO['EFO term'].isin(countstudiesperEFO)].sort_values(
        by='isfemale', ascending=False).iterrows():
    holdstring = holdstring + \
        row['EFO term'].title() + ' (' + str(round(row['isfemale'], 3) *
                                             100) + '%)\n'
print('\n' + holdstring[:-1])
Get the numbers for Parent Terms:
Biological Process (37.1%)
Body Meas. (40.0%)
Cancer (47.8%)
Cardiovascular Disease (30.7%)
Cardiovascular Meas. (32.1%)
Digestive System Disorder (32.2%)
Hematological Meas. (34.0%)
Immune System Disorder (35.699999999999996%)
Inflammatory Meas. (28.999999999999996%)
Lipid/Lipoprotein Meas. (36.199999999999996%)
Liver Enzyme Meas. (36.1%)
Metabolic Disorder (33.900000000000006%)
Neurological Disorder (35.4%)
Other Disease (34.5%)
Other Meas. (36.6%)
Other Trait (35.6%)
Response To Drug (33.300000000000004%

Get the numbers for Trait Terms (figure above):
Breast Carcinoma (50.5%)
Prostate Carcinoma (41.699999999999996%)
Body Mass Index (39.800000000000004%)
Alzheimers Disease (36.6%)
Triglyceride Meas. (36.3%)
Asthma (36.199999999999996%)
Bone Density (35.6%)
Hd Lc Measurement (35.3%)
Unipolar Depression (35.3%)
Ld Lc Measurement (34.699999999999996%)
Colorectal Cancer (34.1%)
Bipolar Disorder (34.0%)
Type Ii Diabetes Mellitus (32.800000000000004%)
Schizophrenia (30.0%)

Compare Author Metrics Between Genders

In [44]:
allauthors_merged_reset = allauthors_merged.reset_index().rename(columns={'index': 'Name'})
allauthors_merged_reset['CleanForename'] = allauthors_merged_reset['Name'].map(lambda x: x.split(' ')[0]) 
allauthors_merged_reset['CleanGender'] = allauthors_merged_reset['CleanForename'].map(
    lambda x: gendet.get_gender(x))
allauthors_merged_reset['MaleFemale'] = allauthors_merged_reset['CleanGender'].str.replace('mostly_',
                                                                     '')
allauthors_merged_reset['isfemale'] = np.where(
   allauthors_merged_reset['MaleFemale'] == 'female', 1, 0)

print('The average GWAS-H Index for females is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='female']['GWAS-H'].mean(),3)))
print('The average GWAS-H Index for males is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='male']['GWAS-H'].mean(),3)))
print('The average number of papers published by females is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='female']['Papers'].mean(),3)))
print('The average number of papers published by males is: ' +
     str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='male']['Papers'].mean(),3)))
print('The average number of citations for females is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='female']['citedByCount'].mean(),3)))
print('The average number of citations for males is: ' +
      str(round(allauthors_merged_reset[allauthors_merged_reset['MaleFemale']=='male']['citedByCount'].mean(),3)))
The average GWAS-H Index for females is: 4.826
The average GWAS-H Index for males is: 5.327
The average number of papers published by females is: 6.089
The average number of papers published by males is: 7.112
The average number of citations for females is: 642.197
The average number of citations for males is: 776.021

Cohorts and Consortium

Consortium/collectives

Lets use the PubMed database returns on collectives to analyze which consortium and study groups are acknowledged:

In [45]:
pubmed_collectives = pd.read_csv(os.path.abspath(
                                 os.path.join('__file__', '../..',
                                              'data', 'PUBMED',
                                              'Pubmed_CollectiveInfo.csv')),
                                 encoding='latin-1')
collect_dict = pd.read_csv(os.path.abspath(
                           os.path.join(
                               '__file__', '../..', 'data',
                               'Support', 'Collectives',
                               'Collective_Dictionary.csv')),
                           encoding='latin-1')
pubmed_collectives['COLLECTIVE'] = pubmed_collectives['COLLECTIVE'].str.strip()
pubmed_collectives['COLLECTIVE'] = pubmed_collectives['COLLECTIVE'].str.lower()
collect_dict['COLLECTIVE'] = collect_dict['COLLECTIVE'].str.strip(
).str.lower()
merged_collect = pd.merge(
    pubmed_collectives, collect_dict, how='left', on='COLLECTIVE')
if len(merged_collect[merged_collect['Clean_Name'].isnull()]) > 0:
    print('Danger! Correct collectives in Collective_Unverified.csv ' +
          'and add to the dictionary\n')
    unverified_collect = merged_collect[merged_collect['Clean_Name'].isnull(
    )]
    unverified_collect.to_csv(os.path.abspath(
                              os.path.join('__file__', '../..',
                                           'data', 'Support', 'Collectives',
                                           'Collective_Unverified.csv')), encoding='latin-1')
else:
    print('The consortium dictionary is up to date!\n')
consortiumlist = []
for index, row in merged_collect.iterrows():
    if pd.isnull(row['Clean_Name']) is False:
        for consortium in row['Clean_Name'].split(';'):
            consortiumlist.append(consortium)
clean_collectives = pd.DataFrame(consortiumlist, columns=['Consortium'])
groupby_collectives = clean_collectives.groupby(
    ['Consortium'])['Consortium'].count().sort_values(ascending=False)
holdstring = 'The most frequently seen Consortium are:\n'
for index, row in groupby_collectives.to_frame()[0:10].iterrows():
    holdstring = holdstring + index + ' (' + str(row['Consortium']) + ')\n'
print(holdstring[:-2] + ')')

print('\nWe have seen a total of ' +
      str(len(groupby_collectives.to_frame())) + ' different collectives!')
print('A total of ' + str(len(pubmed_collectives['PUBMEDID'].drop_duplicates(
))) + ' papers are contributed to by at least one collective.')
print('A total of ' + str(groupby_collectives.sum()) +
      ' collective contributions are made.')
The consortium dictionary is up to date!

The most frequently seen Consortium are:
Wellcome Trust Case Control Consortium (48)
CHARGE (44)
Wellcome Trust Case Control Consortium 2 (36)
LifeLines Cohort Study (30)
DIAGRAM (29)
Alzheimer's Disease Neuroimaging Initiative (25)
CARDIOGRAM (24)
MAGIC (20)
GIANT Consortium (18)
kConFab (16)

We have seen a total of 668 different collectives!
A total of 824 papers are contributed to by at least one collective.
A total of 1624 collective contributions are made.

Datasets

Lets now simply wrangle together some of the manually collated data and consider the most frequently observed cohorts, with the aim being to analyze the resampling issue discussed earlier in the literature.

In [46]:
finaldataset = pd.read_csv(os.path.abspath(
    os.path.join('__file__', '../..', 'data',
                 'Support', 'Cohorts',
                 'manually_parsed_data.csv')),
    encoding='latin-1')
mylist = []
for index, row in finaldataset.iterrows():
    mylist = mylist + row['DATASETS'].split(';')
mylist = [x.strip().upper() for x in mylist]
df1 = pd.DataFrame({'Cohorts': mylist})
reader = csv.reader(open(os.path.abspath(os.path.join(
    '__file__', '../..', 'data', 'Support', 'Cohorts',
    'Dictionary_cohorts.csv')), 'r'))
d = {}
for row in reader:
    k, v = row
    d[k] = v
for key in d:
    df1['Cohorts'].replace(key, d[key], inplace=True)
df1['Cohorts'] = df1['Cohorts'].str.replace(
    'Rotterdam Study I (RS-I)', 'Rotterdam Study (RS)')
df1['Cohorts'].replace('&', 'and', inplace=True)
df1['Cohorts'] = df1['Cohorts'].str.upper()
newdf = pd.DataFrame(df1.groupby(['Cohorts'])[
                     'Cohorts'].count(), columns=['Count'])
newdf = pd.DataFrame({'Count': df1.groupby(['Cohorts']).size()}).reset_index()

Then merge this with manually collated data:

In [47]:
manual_cohort_for_merge = pd.read_csv(os.path.abspath(
    os.path.join('__file__', "../..", "data",
                 "Support", "Cohorts",
                 "manual_cohort_for_merge.csv")),
    encoding='latin-1', index_col=False)
merged_manual = pd.merge(newdf, manual_cohort_for_merge,
                         how='left', on='Cohorts')
merged_manual.set_index('Cohorts').sort_values(
    by='Count', ascending=False).drop('NO NAME').head(10).style
Out[47]:
Count N Country of Recruitment Age Range Study Design Female (%)
Cohorts
ROTTERDAM STUDY (RS) 398 14926 Netherlands 55-106 Prospective cohort 57.14
COOPERATIVE HEALTH RESEARCH IN THE REGION OF AUGSBURG (KORA) 255 18079 Germany 24-75 Population-based 50%
FRAMINGHAM HEART STUDY (FHS) 207 15447 US 30-62 (original) now 5-85 Prospective cohort, three generation 53.6
ATHEROSCLEROSIS RISK IN COMMUNITIES STUDY (ARIC) 204 15792 US 45-64 Prospective cohort, Community 55.15
CARDIOVASCULAR HEALTH STUDY (CHS) 179 5888 US 65+ Prospective cohort 57.64
BRITISH 1958 BIRTH COHORT STUDY (B58C) 156 17634 UK 0+ Prospective birth cohort 48
UK ADULT TWIN REGISTER (TWINSUK) 140 12000 UK, Ireland 18-97 Longitudinal entry at various times 84%
EUROPEAN PROSPECTIVE INVESTIGATION INTO CANCER (EPIC) 132 521330 10 EU countries 21-83 (varies by country) Prospective cohort 70.57
NURSES HEALTH STUDY (NHS) 129 121700 US 30-55 Prospective cohort 100
STUDY OF HEALTH IN POMERANIA (SHIP) 127 4308 Germany 20-79 Prospective cohort 50.84

Lets now print out a list of the 50 most commonly used datasets in case we need to reference any specific particularily prominent ones (e.g. deCODE, UKBioBank etc):

In [48]:
merged_manual=merged_manual.set_index('Cohorts').sort_values(by='Count', ascending=False)
holder='The 30 most commonly used datasets are:\n\n'
for index, row in merged_manual[0:30].iterrows():
    holder = holder + \
        str(index) + ' (' + str(row['Count']) + ' times)\n'
print(holder[:-2])
merged_manual[['Count']].to_csv(os.path.abspath(
                                os.path.join('__file__',
                                             '../..',
                                             'data',
                                             'Support',
                                             'Cohorts',
                                             'Cohort_Count.csv')), encoding='latin-1')
The 30 most commonly used datasets are:

ROTTERDAM STUDY (RS) (398 times)
NO NAME (297 times)
COOPERATIVE HEALTH RESEARCH IN THE REGION OF AUGSBURG (KORA) (255 times)
FRAMINGHAM HEART STUDY (FHS) (207 times)
ATHEROSCLEROSIS RISK IN COMMUNITIES STUDY (ARIC) (204 times)
CARDIOVASCULAR HEALTH STUDY (CHS) (179 times)
BRITISH 1958 BIRTH COHORT STUDY (B58C) (156 times)
UK ADULT TWIN REGISTER (TWINSUK) (140 times)
EUROPEAN PROSPECTIVE INVESTIGATION INTO CANCER (EPIC) (132 times)
NURSES HEALTH STUDY (NHS) (129 times)
STUDY OF HEALTH IN POMERANIA (SHIP) (127 times)
AGING GENE-ENVIRONMENT SUSCEPTIBILITY - REYKJAVIK STUDY (AGES) (119 times)
DECODE GENETICS (DECODE) (113 times)
ERASMUS RUCPHEN FAMILY STUDY (ERF) (104 times)
WELLCOME TRUST CASE CONTROL CONSORTIUM (WTCCC) (103 times)
MULTI-ETHNIC STUDY OF ATHEROSCLEROSIS (MESA) (98 times)
HEALTH, AGING, AND BODY COMPOSITION STUDY (HEALTH ABC) (92 times)
NORTHERN FINNISH BIRTH COHORT STUDY (NFBC) (87 times)
NETHERLANDS TWIN REGISTRY (NTR) (78 times)
ORKNEY COMPLEX DISEASE STUDY (ORCADES) (78 times)
LOTHIAN BIRTH COHORT (LBC) (78 times)
ESTONIAN GENOME CENTER OF THE UNIVERSITY OF TARTU (EGCUT) (75 times)
AVON LONGITUDINAL STUDY OF PARENTS AND CHILDREN (ALSPAC) (75 times)
COHORTE LAUSANNOISE (COLAUS) (75 times)
THE INVECCHIARE IN CHIANTI (INCHIANTI) (70 times)
WOMENS GENOME HEALTH STUDY (WGHS) (69 times)
HEALTH PROFESSIONAL FOLLOW-UP STUDY (HPFS) (69 times)
SARDINIA STUDY OF AGING (SARDINIA) (68 times)
BIOBANK JAPAN PROJECT (BBJ) (68 times)
CARDIOVASCULAR RISK IN YOUNG FINNS STUDY (YFS) (67 times
In [49]:
Cat_Studies, Cat_Ancestry, Cat_Ancestry_groupedbyN, Cat_Full = load_gwas_cat()
PUBMED_by_N = pd.DataFrame(Cat_Ancestry.groupby(['PUBMEDID'])['N'].agg('sum').sort_values(ascending=False))
pubmed_manual = pd.merge(PUBMED_by_N.reset_index(), finaldataset, left_on='PUBMEDID',
                                 right_on='PUBMEDID', how='left')
print('We have counted ' + str(len(df1['Cohorts'])) + ' (non-unique) cohorts in total')
print('We have counted ' + str(len(newdf)) + ' unique cohorts in total.')
print("We're currently missing " +
      str(len(set(PUBMED_by_N[0:1000].index)-set(finaldataset['PUBMEDID']))) + 
     " of the biggest 1000 papers grouped by summed ancestry!")
print('We currently have coverage for ' + 
      str(round((pubmed_manual[pubmed_manual['In Charge'].notnull()]['N'].sum()/
          PUBMED_by_N['N'].sum())*100,2)) +
      '% of all papers by N.')
print('We currently have coverage for ' + 
      str(round((len(finaldataset)/len(PUBMED_by_N))*100,2)) +
      '% of all papers in absolute.')
pubmed_manual.to_csv(os.path.abspath(
                             os.path.join('__file__',
                                          '../..',
                                          'data',
                                          'Support',
                                          'Cohorts',
                                          'Merged_Cohorts_Manual_Curation.csv')),
                     encoding='latin-1')
not_curated = pubmed_manual[~pubmed_manual['In Charge'].notnull()]['PUBMEDID']
not_curated.to_csv(os.path.abspath(os.path.join('__file__', '../..', 'data',
                                                'Support', 'Cohorts',
                                                'Outstanding_PUBMEDIDs_To_Curate.csv')),
                   encoding='latin-1')
We have counted 11851 (non-unique) cohorts in total
We have counted 2000 unique cohorts in total.
We're currently missing 38 of the biggest 1000 papers grouped by summed ancestry!
We currently have coverage for 87.76% of all papers by N.
We currently have coverage for 34.84% of all papers in absolute.